Machine Learning Overview

Machine Learning is a way to let a computer learn patterns from historical data.

In normal programming, we write rules by ourselves:

if user clicks add_to_cart:
    maybe user wants to buy

In machine learning, we prepare examples and let the model learn the pattern:

past user behavior + purchase result -> model learns the pattern

For data engineers, the important idea is:

Machine learning depends on clean, well-organized data.

If the data is messy, duplicated, missing, or designed with the wrong meaning, the model will also be unreliable.

Python Toolset

For this beginner section, we will use:

Python
pandas
scikit-learn

Install the main packages:

pip install pandas scikit-learn

These tools are enough to learn the basic workflow:

raw data -> pandas DataFrame -> feature table -> scikit-learn model -> evaluation

Example Scenario

We will use an e-commerce behavior example.

The raw data may contain events like:

session_start
view_item
add_to_cart
purchase

The business question is:

Can we predict whether a user will purchase soon?

This is called a purchase propensity problem.

Basic Terms

Row

A row is one training example.

For this example, one row should represent one user.

one user = one row

Feature

A feature is an input column used by the model.

Examples:

How many sessions did the user have?
How many products did the user view?
How many times did the user add items to cart?
How many days since the user's last activity?

Label

A label is the answer we want the model to learn.

For this example:

label = did this user purchase?

The label can be:

1: yes, the user purchased
0: no, the user did not purchase

Model

A model is the result of training.

After training, the model can receive new user behavior and output a prediction.

Prediction

A prediction is the model's guess.

For example:

user_001 -> 0.82 probability of purchase
user_002 -> 0.13 probability of purchase

Machine Learning Workflow

In this basic course, we will focus on:

Data preparation
Feature engineering
Model training
Model evaluation
Prediction

Common Machine Learning Types

Classification

Classification predicts a category.

Examples:

Will the user purchase? yes or no
Is this transaction fraud? yes or no
Is this email spam? spam or not spam

Our purchase prediction example is a classification problem.

Regression

Regression predicts a number.

Examples:

How much revenue will we get tomorrow?
How long will delivery take?
What will the house price be?

Clustering

Clustering groups similar data together.

Examples:

Group users by behavior
Group products by buying pattern
Group articles by topic

Why Data Engineers Should Learn This

Data engineers do not always train models every day, but they often build the data foundation for machine learning.

Common data engineering responsibilities include:

Collect raw data from systems
Clean and transform data
Build reliable feature tables
Schedule pipelines
Monitor data quality
Deliver data to analysts, data scientists, or ML systems

Machine learning projects often fail because the data pipeline is weak, not because the model algorithm is weak.

Python Toolset​

Example Scenario​

Basic Terms​

Row​

Feature​

Label​

Model​

Prediction​

Machine Learning Workflow​

Common Machine Learning Types​

Classification​

Regression​

Clustering​

Why Data Engineers Should Learn This​

Python Toolset

Example Scenario

Basic Terms

Row

Feature

Label

Model

Prediction

Machine Learning Workflow

Common Machine Learning Types

Classification

Regression

Clustering

Why Data Engineers Should Learn This